Towards Detecting Annotation Errors in Spoken Language Corpora
نویسندگان
چکیده
The issue Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), only recently has there been some work in detecting errors in syntactic and other structural annotation (Dickinson and Meurers, 2003b; Ule and Simov, 2004). Spoken language differs in many respects from written language, but to the best of our knowledge the issue of error detection in spoken language corpora has not yet been addressed. This is significant since spoken data is increasingly relevant for linguistic and computational research—and such corpora are starting to become more readily available. We address this issue in this paper, based on the variation n-gram error detection approach developed in Dickinson and Meurers (2003a). We use the German Verbmobil treebank (Hinrichs et al., 2000) as an exemplar of a spoken language corpus and discuss properties of such corpora which are relevant when adapting the variation n-gram approach to spoken language corpora.
منابع مشابه
Detecting Annotation Errors in Spoken Language Corpora
Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...
متن کاملTranscribing Speech: Errors in Corpora and Experimental Settings
Administrations, government organs, judiciary courts always faced the problem of defining limits in transcription practices. Nowadays corpus linguistics and computational linguistics have focused their attention on spoken corpora as indispensable tools for descriptive linguistics, as well as for applied purposes (in speech technologies, such as text-to-speech and speech recognition, in dialogue...
متن کاملWhat might a corpus of parsed spoken data tell us about language?
This paper summarises a methodological perspective towards corpus linguistics that is both unifying and critical. It emphasises that the processes involved in annotating corpora and carrying out research with corpora are fundamentally cyclic, i.e. involving both bottom-up and top-down processes. Knowledge is necessarily partial and refutable. This perspective unifies ‘corpus-driven’ and ‘theory...
متن کاملCorpus of Spoken Slovak Language
In this paper a short description of activities towards building a general speech corpus of spoken Slovak language is given. Different rôles and specific features of text corpus and speech corpus are investigated as well as the most frequent mistakes and misunderstandings of the concept of a speech corpus are mentioned. The concept of a big representative corpus of spoken language and its desir...
متن کاملDECCA Project Description
In the past decade, research and applications in human language technology have strongly been influenced by the success of data-driven and stochastic modeling of natural language based on electronic corpora annotated with linguistic information. Annotated corpora are fundamental for training and testing algorithms in statistical natural language processing, and they are essential as gold standa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005